Introduction

Main findings:

Analysis design

Question: “How have COVID hotspots changed over time?”

We are interested in understanding the spatial and temporal characteristics of the spread of COVID-19 in the contiguous US, e.g “Where and when do COVID cases and deaths increase or decrease?” To that end, we utilize two approaches to investigate this hypothesis.

Analysis

Part 1: Spatio-temporal heatmapping of cases and deaths across the contiguous US

Data cleaning

library(tidyverse)
library(lubridate)
JHUdata_cases <- read_csv("https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")

JHUdata_deaths <- read_csv("https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv")
  • Both datasets have the same structure that is somewhat problematic.
    • The dates are column headings rather than grouped into a single variable.
    • The dataset contains county-level data that will be cleaned when we do state-level plotting.
JHUdata_deaths %>% head()
JHUdata_cases %>% head()
  • Perform the same set of cleaning steps:
    • Use pivot_longer with a regex to collapse all the date columns into a date variable.
    • Rename columns to simpler names.
    • Parse the date field from chr to date.
    • Collapse county-level data into state-level data by using group_by and summarize, since we are interested first in state-level trends.
      • This variable is added as cumul for cumulative state-level data.
cases <- JHUdata_cases %>% 
  pivot_longer(matches("[0-9]+/[0-9]+/[0-9]+"),names_to="date",values_to="cumul") %>% 
  rename(c("state"="Province_State")) %>% 
  mutate(date=mdy(date)) %>% 
  group_by(date,state) %>% summarize(cumul=sum(cumul))

deaths <- JHUdata_deaths %>% 
  pivot_longer(matches("[0-9]+/[0-9]+/[0-9]+"),names_to="date",values_to="cumul") %>% 
  rename(c("state"="Province_State")) %>% 
  mutate(date=mdy(date)) %>% 
  group_by(date,state) %>% summarize(cumul=sum(cumul))
  • However, we would like to plot new cases over time and the dataset doesn’t contain a “new cases” column. So we need to get a first-order difference of the cumulative time-series.
    • dplyr lag didn’t function properly and returned NAs. So, an in-house function to compute first-order differences was written.
    • first_diff simply iterates through a time series and appends the difference to the correct slot in a vector.
    • add_firstdiff is meant to take a national dataframe and get the 1st order difference for each state.
      • We do this by iterating over every state, filtering the dataframe to get that state, and creating a list of first order differences for each state.
      • Then we combine them all into one dataframe using reduce and full_join.
# Cumulative plot!
# Not what we want.
cases %>% filter(state=="Virginia") %>% 
  ggplot(aes(x=date,y=cumul)) +
  geom_line()

first_diff <- function(arr){
  to_return <- NULL
  for (i in 2:length(arr)){
   to_return[i] <- arr[i] - arr[i-1]
  }
  return(to_return)
}

add_firstdiff <- function(df){
  unique_states <- unique(df$state)
  dfs <- list()
  for (this_state in unique_states){
    thisdf <- df %>% filter(state==this_state)
    thisdf$diff <- first_diff(thisdf$cumul)
    dfs[[length(dfs) + 1]] <- thisdf
  }
  return(dfs %>% reduce(full_join))
}
  • The firstdiff functions are applied to the case and deaths dataframes to get the new cases and new deaths per day.
    • They are then merged together with left_join to get a master dataframe.
    • Only the relevant columns (date, state, new cases/new deaths) are selected.
    • The state variable is converted to lower-case to facilitate joining with GeoJSON data later.
    • We filter out all days before Feb 02, 2020 as there were not many cases in January (relative to the largest spikes in the pandemic), so they will not show up very clearly on the heatmap.
      • Also, it makes the time intervals convenient to depict because then we can consider whole months beginning from the 1st day of every month.
    • We make sure to filter out regions that are not US states.
      • The dataset also includes information about US territories (and even a cruise ship!)
cases <- add_firstdiff(cases) %>% 
  select(date,state,diff) %>% 
  rename(c("positiveIncrease"="diff"))
deaths <- add_firstdiff(deaths) %>% 
  select(date,state,diff) %>% 
  rename(c("deathIncrease"="diff"))

data <- left_join(cases,deaths,by=c("date","state")) %>% 
  mutate(state=tolower(state)) %>% 
  filter(date >= date("2020-02-01"))

data <- data %>% filter(state %in% tolower(state.name))
# Our difference function worked!
cases %>% filter(state=="Virginia") %>% 
  ggplot(aes(x=date,y=positiveIncrease)) +
  geom_line()

data %>% ggplot(aes(x=date,y=positiveIncrease)) + geom_line(aes(color=state),alpha=0.3)

  • We want to get the dates that our plots will range over.
    • We’ll generate one plot for each month of the pandemic. So use seq to interpolate between the earliest date and the latest date by months.
    • The last date is manually added (it might not be a full month to the last date in the data frame) so that we can also incorporate the most recent information as well.
  • We want to generate a set of intervals (ordered pairs) of months, e.g from Jan to Feb, Feb to Mar, Mar to Apr..
    • A simple for-loop takes care of this and stores the info in intervals.
dates <- seq(min(data$date),max(data$date),by="months")
dates <- dates %>% append(max(data$date))

intervals <- list() 
for (i in 2:length(dates)){
  intervals[[i-1]] <- c(dates[i-1],dates[i])
}
  • R has built-in geographical data that represents each state as a set of polygons by lat and long.
    • We will use this to plot the US map.
states_map <- map_data("state")
  • Here are the functions that plot a map for one specific interval.
    • It works by filtering out all observations in the data in that interval, grouping by state, and then getting the cumulative increase in each state over that interval of time.
    • Then, that data is joined with the map data to produce a plot using ggplot.
    • The gradient scale is then set at an absolute number to ensure that the scale is consistent across all plots produced.
      • This will make sure we can compare maps across time.
    • The map is titled accordingly using the paste function so we know which date interval it is.
    • The plot is cleaned by removing some unnecessary elements.
heatmap_for_givenday_deathIncrease <- function(interval){
  cumul_increase <- data %>% 
    filter(between(date,interval[1],interval[2])) %>% 
    group_by(state) %>% 
    summarize(total=sum(deathIncrease))
  #cumul_increase$statename <-  tolower(state.name[match(cumul_increase$state,state.abb)])
  loc_data <- left_join(states_map,cumul_increase,by=c("region"="state"))
  # Used to see the max value in order to set the gradient. 
  #print(max(cumul_increase$total))
  p <- loc_data %>% 
    ggplot() + 
    geom_polygon(aes(x=long,y=lat,group=group,fill=total),color="grey") +
    scale_fill_gradient(low="white",high="red", limits=c(0,22136)) +
    ggtitle(paste("New deaths: ",interval[1]," to ",interval[2])) +
    theme(axis.title=element_blank(),
          axis.ticks=element_blank(),
          panel.grid=element_blank(),
          axis.text.x = element_blank(),
          axis.text.y = element_blank())
  return(p)
}

heatmap_for_givenday_positiveIncrease <- function(interval){
  cumul_increase <- data %>% 
    filter(between(date,interval[1],interval[2])) %>% 
    group_by(state) %>% 
    summarize(total=sum(positiveIncrease))
  #cumul_increase$statename <-  tolower(state.name[match(cumul_increase$state,state.abb)])
  loc_data <- left_join(states_map,cumul_increase,by=c("region"="state"))
  #loc_data <- left_join(states_map,cumul_increase,by=c("region"="statename"))
 
  # Used to see the max value in order to set the gradient. 
  #print(max(cumul_increase$total))
  
  
  p <- loc_data %>% 
    ggplot() + 
    geom_polygon(aes(x=long,y=lat,group=group,fill=total),color="grey") +
    #scale_fill_gradient(low="white",high="red",limits=c(0,1259607)) +
    scale_fill_gradient(low="white",high="red",limits=c(0,1259607)) +
    ggtitle(paste("New cases: ",interval[1]," to ",interval[2])) +
        theme(axis.title=element_blank(),
          axis.ticks=element_blank(),
          panel.grid=element_blank(),
          axis.text.x = element_blank(),
          axis.text.y = element_blank())
  return(p)
}
  • We can use map from purrr to apply our function readily across our list of intervals, making the process of generating each plot very straightforward.
    • A sample plot is displayed below (there are too many to include in this document).
    • They are exported by running each line again manually.
plots <- map(intervals,heatmap_for_givenday_deathIncrease)
plots <- map(intervals,heatmap_for_givenday_positiveIncrease)
plots[[12]]

  • These maps can be exported batch-by-batch using the code below.
for (i in 1:length(plots)){
  ggsave(plot=plots[[i]],path="../project/out",filename=paste("cases",i,".png",sep=""),width=5,height=4,device="png")
}
  • ffmpeg was then used to join the images together into a video.
    • The command needed to be adjusted slightly from batch to batch.
ffmpeg -r 1 -f image2 -s 1920x1080 -i ../deaths/project/out/cases/cases%d.png -vcodec libx264 -crf 25 -pix_fmt yuv420p test.mp4
  • The videos cannot be embedded into a PDF,

  • This whole procedure can also be replicated for each region of the US, giving us a better idea of regional trends as well.

    • A csv containing the region names for each state is imported and joined with the data.
us_regions <- read_csv("https://raw.githubusercontent.com/cphalpert/census-regions/master/us%20census%20bureau%20regions%20and%20divisions.csv")

data <- left_join(cases,deaths) %>% 
  filter(date >= date("2020-02-01"))

data <- left_join(data,us_regions,by=c("state"="State"))
  • The functions are adjusted slightly to allow filtering by region.
    • An upbound argument is added to make it easier to set the absolute scale of the gradient.
heatmap_for_givenday_positiveIncrease <- function(interval,region,upbound){
  cumul_increase <- data %>% 
    filter(Region == region) %>% 
    filter(between(date,interval[1],interval[2])) %>% 
    group_by(state) %>% #added when JHU data used
    mutate(state=tolower(state)) %>% 
    summarize(total=sum(positiveIncrease))
  #cumul_increase$statename <-  tolower(state.name[match(cumul_increase$state,state.abb)])
  #loc_data <- inner_join(states_map,cumul_increase,by=c("region"="statename"))
  loc_data <- inner_join(states_map,cumul_increase,by=c("region"="state"))
  
  print(max(cumul_increase$total))
  
  p <- loc_data %>% 
    ggplot() + 
    geom_polygon(aes(x=long,y=lat,group=group,fill=total),color="grey") +
    scale_fill_gradient(low="white",high="red",limits=c(0,upbound)) +
    ggtitle(paste(region,"- New cases: ",interval[1]," to ",interval[2])) +
        theme(axis.title=element_blank(),
          axis.ticks=element_blank(),
          panel.grid=element_blank(),
          axis.text.x = element_blank(),
          axis.text.y = element_blank())
  return(p)
}

heatmap_for_givenday_deathIncrease <- function(interval,region,upbound){
  cumul_increase <- data %>% 
    filter(Region == region) %>% 
    filter(between(date,interval[1],interval[2])) %>% 
    group_by(state) %>% 
    mutate(state=tolower(state)) %>% 
    summarize(total=sum(deathIncrease))
  #cumul_increase$statename <-  tolower(state.name[match(cumul_increase$state,state.abb)])
  #loc_data <- inner_join(states_map,cumul_increase,by=c("region"="statename"))
  loc_data <- inner_join(states_map,cumul_increase,by=c("region"="state"))
  
  print(max(cumul_increase$total))
  
  p <- loc_data %>% 
    ggplot() + 
    geom_polygon(aes(x=long,y=lat,group=group,fill=total),color="grey") +
    scale_fill_gradient(low="white",high="red",limits=c(0,upbound)) +
    ggtitle(paste(region,"- New deaths: ",interval[1]," to ",interval[2])) +
        theme(axis.title=element_blank(),
          axis.ticks=element_blank(),
          panel.grid=element_blank(),
          axis.text.x = element_blank(),
          axis.text.y = element_blank())
  return(p)
}
  • The maps are once again generated in a very similar procedure.
    • Below is an example regional map.
plots <- map(intervals,heatmap_for_givenday_positiveIncrease,"South",639843)
plots <- map(intervals,heatmap_for_givenday_positiveIncrease,"West",1259607)
plots <- map(intervals,heatmap_for_givenday_positiveIncrease,"Northeast",450056)
plots <- map(intervals,heatmap_for_givenday_positiveIncrease,"Midwest",322299)

plots <- map(intervals,heatmap_for_givenday_deathIncrease,"South",10159)
plots <- map(intervals,heatmap_for_givenday_deathIncrease,"West",15922)
plots <- map(intervals,heatmap_for_givenday_deathIncrease,"Northeast",22136)
plots <- map(intervals,heatmap_for_givenday_deathIncrease,"Midwest",6538)
plots[[12]]

  • A similar code and bash command is used to generate animations.
for (i in 1:length(plots)){
  ggsave(plot=plots[[i]],path="../project/out",filename=paste("northeast_cases",i,".png",sep=""),width=5,height=4,device="png")
}
  • National cases:
  • National deaths:
  • Midwest cases
  • Midwest deaths
  • Northeast cases
  • Northeast deaths
  • South cases
  • South deaths
  • West cases
  • West deaths

Part 2: Spatio-temporal clustering of cases and deaths across the contiguous US at the state level

# Make analysis reproducible
set.seed(2)
library(tidyverse)
library(lubridate)
library(cluster)
library(rvest)
library(patchwork)
JHUdata_cases <- read_csv("https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")

JHUdata_deaths <- read_csv("https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv")

first_diff <- function(arr){
  to_return <- NULL
  for (i in 2:length(arr)){
   to_return[i] <- arr[i] - arr[i-1]
  }
  return(to_return)
}

add_firstdiff <- function(df){
  unique_states <- unique(df$state)
  dfs <- list()
  for (this_state in unique_states){
    thisdf <- df %>% filter(state==this_state)
    thisdf$diff <- first_diff(thisdf$cumul)
    dfs[[length(dfs) + 1]] <- thisdf
  }
  return(dfs %>% reduce(full_join))
}

cases <- JHUdata_cases %>% pivot_longer(matches("[0-9]+/[0-9]+/[0-9]+"),names_to="date",values_to="cumul") %>% 
  rename(c("state"="Province_State")) %>% 
  mutate(date=mdy(date)) %>% 
  group_by(date,state) %>% summarize(cumul=sum(cumul))

deaths <- JHUdata_deaths %>% pivot_longer(matches("[0-9]+/[0-9]+/[0-9]+"),names_to="date",values_to="cumul") %>% 
  rename(c("state"="Province_State")) %>% 
  mutate(date=mdy(date)) %>% 
  group_by(date,state) %>% summarize(cumul=sum(cumul))


cases <- add_firstdiff(cases) %>% select(date,state,diff) %>% rename(c("positiveIncrease"="diff"))
deaths <- add_firstdiff(deaths) %>% select(date,state,diff) %>% rename(c("deathIncrease"="diff"))

data <- left_join(cases,deaths) %>% 
  ungroup() %>%
    filter(state %in% state.name) %>% 
    filter(state != "Alaska") %>%
    filter(state != "Hawaii")
data %>% head()
data %>% ggplot(aes(x=date,y=positiveIncrease)) + geom_line(aes(color=state),alpha=0.3)

content <- read_html("https://inkplant.com/code/state-latitudes-longitudes")
states_latlong <- html_table(content,header=T)
states_latlong <- as_tibble(data.frame(states_latlong))
states_latlong <- states_latlong %>% rename(c("state"="State","lat"="Latitude","long"="Longitude"))
# Check how the states distribute
# Looks like the capital cities of each state, that's fine
states_map <- map_data("state")
ggplot(states_map, aes(long, lat)) +
  geom_polygon(aes(group = group),fill = "white", colour = "black") +
  geom_point(data=states_latlong)

data <- data %>% left_join(states_latlong)
data %>% head()

Min-max: To scale a vector \(X\), \(x_{i_{scaled}} = \frac{x_i - min(X)}{max(X) - min(X)}\) for each \(x_i\) in \(X\).

# Function to normalize a vector
norm <- function(my_vec){
  my_max <- max(my_vec)
  my_min <- min(my_vec)
  if (my_max-my_min ==0){
    print("err: divide by zero")
  }
  return((my_vec - my_min)/(my_max-my_min))
}
temp <- data %>% filter(!is.na(positiveIncrease) & !is.na(deathIncrease)) 
cor(temp$positiveIncrease,temp$deathIncrease)
## [1] 0.6295024
# Prepare the data for clustering, i.e
# remove NAs,
# and normalize features
cluster_deaths <- data %>% 
filter(
  !is.na(positiveIncrease) & 
  !is.na(deathIncrease) &
  !is.na(lat) & 
  !is.na(long)) %>%
mutate(date=as.integer(date)) %>%
mutate(date=norm(date)) %>%
mutate(
  positiveIncrease=norm(positiveIncrease),
  deathIncrease=norm(deathIncrease),
  lat=norm(lat),
  long=norm(long)
) %>% select(-state,-positiveIncrease)

cluster_cases <- data %>% 
filter(
  !is.na(positiveIncrease) & 
  !is.na(deathIncrease) &
  !is.na(lat) & 
  !is.na(long)) %>%
mutate(date=as.integer(date)) %>%
mutate(date=norm(date)) %>%
mutate(
  positiveIncrease=norm(positiveIncrease),
  deathIncrease=norm(deathIncrease),
  lat=norm(lat),
  long=norm(long)
) %>% select(-state,-deathIncrease)


cluster_deaths %>% head()
cluster_cases %>% head()
# Generate an elbow plot to determine good value of k
plot_elbow <- function(data,up_to){
  elbow <- unlist(map(1:up_to, function(k){
    clusters <- kmeans(data,k,50)
    return (clusters$tot.withinss)
  }))
  data.frame(k=1:up_to,elbow) %>% ggplot(aes(x=k,y=elbow)) +
    geom_line() + geom_point(size=5) +
    scale_x_continuous(breaks=1:20)
}
plot_elbow(cluster_deaths,15) + ggtitle("Cases") + 
  plot_elbow(cluster_cases,15) + ggtitle("Deaths")

clusters_deaths_3 <- kmeans(cluster_deaths,centers=3,50)
clusters_cases_3 <- kmeans(cluster_cases,centers=3,50)

clusters_deaths_6 <- kmeans(cluster_deaths,centers=6,50)
clusters_cases_6 <- kmeans(cluster_cases,centers=6,50)
# Add the cluster IDs to the data.
add_clust_nums <- function(clust_out){
  data_w_clust <- data %>% 
  filter(
    !is.na(positiveIncrease) & 
    !is.na(deathIncrease) &
    !is.na(lat) & 
    !is.na(long))
  data_w_clust$clust_num <- clust_out$cluster
  return(data_w_clust)
}

wclust_deaths_3 <- add_clust_nums(clusters_deaths_3)
wclust_deaths_6 <- add_clust_nums(clusters_deaths_6)
wclust_cases_3<- add_clust_nums(clusters_cases_3)
wclust_cases_6 <- add_clust_nums(clusters_cases_6)
days <- seq(min(data$date),max(data$date),by="months")
days[length(days)+1] <- max(data$date)

colors <- c("1"="#E69F00", "2" = "#56B4E9", "3" = "#009E73", "4" = "#F0E442", "5" ="#0072B2", "6" ="#D55E00")


plot_clust_timeseries <- function(w_clust_i,name){
  loc_data <-  w_clust_i %>% group_by(date,clust_num) %>%
    summarize(cases=sum(positiveIncrease), deaths=sum(deathIncrease)) 
  
  p1 <- loc_data %>% ggplot(aes(x=date,y=cases)) + 
    geom_smooth(aes(color=factor(clust_num))) + ggtitle(paste("Time series of cases by cluster:",name)) +
    scale_x_continuous(breaks=days[c(T,F)]) +
    scale_color_manual(values=colors)
    
  p2 <- loc_data %>% ggplot(aes(x=date,y=deaths)) + 
    geom_smooth(aes(color=factor(clust_num))) + ggtitle(paste("Time series of deaths by cluster:",name)) +
    scale_x_continuous(breaks=days[c(T,F)]) +
    scale_color_manual(values=colors)
  return(p1 / p2)
}
plot_clust_timeseries(wclust_deaths_3,"deaths_3")

plot_clust_timeseries(wclust_deaths_6,"deaths_6")

plot_clust_timeseries(wclust_cases_3,"cases_3")

plot_clust_timeseries(wclust_cases_6,"cases_6")

- Within-cluster trends

wclust_cases_3 %>%  ggplot(aes(x=date,y=positiveIncrease,color=state,alpha=0.2)) + geom_smooth() + facet_wrap(~clust_num) + ggtitle("Within-cluster time-series, increase in positive cases: cases_3")

wclust_cases_3 %>% ggplot(aes(x=date,y=deathIncrease,color=state,alpha=0.2)) + geom_smooth() + facet_wrap(~clust_num) + ggtitle("Within-cluster time-series, increase in deaths: cases_3")

wclust_cases_6 %>% ggplot(aes(x=date,y=positiveIncrease,color=state,alpha=0.2)) + geom_smooth() + facet_wrap(~clust_num) + ggtitle("Within-cluster time-series, increase in positive cases: cases_6")

wclust_cases_6 %>% ggplot(aes(x=date,y=deathIncrease,color=state,alpha=0.2)) + geom_smooth() + facet_wrap(~clust_num) + ggtitle("Within-cluster time-series, increase in deaths: cases_6")

wclust_deaths_3 %>% ggplot(aes(x=date,y=positiveIncrease,color=state,alpha=0.2)) + geom_smooth() + facet_wrap(~clust_num) + ggtitle("Within-cluster time-series, increase in positive cases: deaths_3")

wclust_deaths_3 %>% ggplot(aes(x=date,y=deathIncrease,color=state,alpha=0.2)) + geom_smooth() + facet_wrap(~clust_num) + ggtitle("Within-cluster time-series, increase in deaths: deaths_3")

wclust_deaths_6 %>% ggplot(aes(x=date,y=positiveIncrease,color=state,alpha=0.2)) + geom_smooth() + facet_wrap(~clust_num) + ggtitle("Within-cluster time-series, increase in positive cases: deaths_6")

wclust_deaths_6 %>% ggplot(aes(x=date,y=deathIncrease,color=state,alpha=0.2)) + geom_smooth() + facet_wrap(~clust_num) + ggtitle("Within-cluster time-series, increase in deaths: deaths_6")

states_map <- map_data("state")
cluster_plot_for_day <- function(day,data_w_clust,name){
  temp <- data_w_clust %>% filter(date==day) %>%
    mutate(state=tolower(state)) %>% select(-lat,-long)
  loc_data <- left_join(states_map,temp,by=c("region"="state"))

  loc_data %>% 
    ggplot() + 
    geom_polygon(aes(x=long,y=lat,group=group,fill=factor(clust_num)),color="grey") +
    scale_fill_manual(values=colors) +
    ggtitle(paste("Location of clusters",day,name,sep=" - ")) +
    theme(axis.title=element_blank(),
          axis.ticks=element_blank(),
          panel.grid=element_blank(),
          axis.text.x = element_blank(),
          axis.text.y = element_blank())
}
cluster_plot_for_day("2020-07-01",wclust_cases_3,"cases_3")

cluster_plot_for_day("2021-01-01",wclust_cases_3,"cases_3")

cluster_plot_for_day("2020-07-01",wclust_cases_6,"cases_6")

cluster_plot_for_day("2021-01-01",wclust_cases_6,"cases_6")

cluster_plot_for_day("2020-07-01",wclust_deaths_3,"cases_3")

cluster_plot_for_day("2021-01-01",wclust_deaths_3,"deaths_3")

cluster_plot_for_day("2020-07-01",wclust_deaths_6,"deaths_6")

cluster_plot_for_day("2021-01-01",wclust_deaths_6,"deaths_6")

Analysis and discussion

At the national level, the spread of COVID is indeed generally characterized a spread from East coast/Northeast states and West coast states through the entire country. Large spikes of cases continue to be concentrated in larger, populous states like New York and California, although strong increases are also observed in Southern and Midwestern states.

  1. The state of New York was a major COVID hotspot during the earlier periods of the pandemic, exhibiting both a high case and death load. New cases and new deaths reached their highest point in this wave from 2020-04-01 to 2020-05-01. The observed death load in this month was one of the highest ever new deaths during the entire pandemic in the contiguous US thus far. At a glance, it seems to be disproportionately high (in terms of the confirmed death/case ratio) compared to the rest of the pandemic.
  1. Both cases and deaths began to rise in California, Texas, and Florida around 2020-08 to 2020-09. By this point, almost all states are seeing new cases. There is a small decrease in the new number of cases in these 3 states from 2020-09 to 2020-11.

  2. There is a dramatic, almost uniform increase in cases across the entire nation from 2020-11 through 2021-03, Some of the largest increases (Roughly in descending order), are in California, Texas, New York, Florida, Illinois, Ohio, Pennsylvania. Also of note are Arizona, Georgia, North Carolina, and Tennessee. Almost all of these increases are accompanied with roughly proportional rises in deaths.

  1. New cases begin to decrease from 03-2021 to 04-2021. At this point, the data ends, so it is not possible to say what much about what happens in this time frame.

Region by region,

Midwest: - Illinois, Michigan see the worst spike in deaths from 2020-05 to 2020-06. - Coinciding with the nationwide increase from 2020-11 to 2021-01, Illinois, Ohio, Michigan, and Indiana see large rises in cases and deaths. The rest of the Midwest (Excluding North Dakota, South Dakota, Nebraska) all see a similar rise in cases to each other (that is less than that of Ohio and Illinois). - Cases decrease in almost all Midwestern states from 03-2021 to 04-2021, except for Michigan.

Northeast: - Corresponding to the national map, New York sees considerable rises in cases and deaths. - when New York rises in cases, it is typically accompanied by a rise in its adjacent states, like Pennsylvania, New Jersey, and Massachusetts. - Compared to the rest of the Northeast, Maine, New Hampshire, and Vermont do not see as large (raw count) of a rise in cases.

South: - Corresponding with the trends in the national map, Texas and Florida dominate much of the increases in the South. - The spread throughout the rest of the South is relatively uniform in the sense that during periods of increased spread, most Southern states see similar magnitudes of increases. - However, during the 2020-12 to 2021-03 wave, cases and deaths seemed to be more concentrated in Georgia, Tennessee, North Carolina, and Virginia.

West: - California more or less dominates most of the case spread in the West over the entire range of available time. - Arizona does also at times see considerable rises in deaths and cases as well. It is probably the “2nd hotspot” in the West.

Cluster analysis:

Investigating the “time series by cluster” plots: - All clustering runs seemed to have created the same temporal slice at around 2020-08. We see a peak before this date and a peak after, indicating that this clustering algorithm probably thinks this is about the right date to delineate the “first wave” and the “second wave.” - This lines up with what we saw in the heatmaps, as we saw the onset of a large increase in cases around that time.

Investigating the within-cluster plots, - the 3-cluster runs seem to have middling results with respect to the time series: in cluster 1, for example, California appears to have had such a large increase in cases that it essentially gets its own cluster. However, the trends of all states within each cluster seem to generally correspond to one another.

The trends observed in the cluster analysis correspond well to those observed using heatmapping, providing extra support for the observed patterns.

These patterns are an excellent stepping stone to subsequent analyses of state-level COVID data. Since this analysis is fundamentally exploratory, it has generated several questions that would be of interest to test in a confirmatory manner. - e.g Since this analysis has suggested that grouping states by their geographical region is indeed logical, we may wish to test that hypothesis with a statistical test (i.e investigate if there is indeed a group difference between regions) -

Ranking regions by record high monthly case volume in ascending order,

region monthly record high case increase
Midwest 312,326
Northeast 404,981
South 639,843
West 1,249,607

Ranking top 10 states by record high monthly case volume,

Investigating hotspots by region,

National trends,

Regional trends,

The cases map:

Deaths map: - 2020-03-13 to 2020-04-13 - New York has the worst count of deaths by far. Other states in the Northeast, California are seeing rises as well. - Deaths continue to worsen in New York through the next month.

Overall at the national level, - New York was a hotspot of COVID activity during the earlier period of the pandemic in the US (from March 2020 to June 2020.) - Relatively speaking, deaths were very high during this period. - New cases and deaths slow from 05 2020 to 06-2020. - Then, in 07-2020 to 08-2020, New cases and deaths begin to increase especially in - California, - Texas, - and Florida. - This continues for a bit until 10-2020, where new cases and deaths begin to slow again.

Qualitatively speaking, the deaths and the cases charts coincide quite well. - It is known that there is a “lag time” between increases in cases and increases in deaths. This may be due to the time required to report the data as well as the time required for cases to develop. - This is especially evident from December 2020 to January 2021. The cases hit their peak in California in December, but the deaths do not peak until the month after, even though new cases in California go down during that period.